Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix nan_as_null not being respected when passing cudf object #14687

Conversation

mroeschke
Copy link
Contributor

@mroeschke mroeschke commented Dec 29, 2023

Description

Fixes the following

In [1]: import cudf

In [2]: import pyarrow as pa

In [5]: cudf.Series(cudf.Series(pa.array([float("nan"), None]), nan_as_null=False), nan_as_null=True)
Out[5]: 
0     NaN <-- not converted to <NA>
1    <NA>
dtype: float64

Needs #14450 first

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@mroeschke mroeschke added bug Something isn't working Python Affects Python cuDF API. non-breaking Non-breaking change labels Dec 29, 2023
@mroeschke mroeschke requested a review from a team as a code owner December 29, 2023 20:22
Copy link
Contributor

@isVoid isVoid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small suggestion.

elif isinstance(arbitrary, cudf.Series):
data = arbitrary._column
column = arbitrary
if column.dtype.kind == "f" and (nan_as_null is None or nan_as_null):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually I tend to prefer some parameter handling at the start of the function, such as:

if nan_as_null is None:
    nan_as_null = True

elif isinstance(arbitrary, cudf.Series):
data = arbitrary._column
column = arbitrary
if column.dtype.kind == "f" and (nan_as_null is None or nan_as_null):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke - The CI failure is a result of us hitting this check when we call compute() in the dask-cudf test.

When a dask collection is "computed", the partitions need to be concatenated together. It seems that calling cudf.concat uses the cudf.Series._concat, which in turn calls this as_column function with a default nan_as_null=True setting. This means that dask-cudf correctly uses the user-provided nan_as_null=False option when each partition is converted from pandas to cudf, but then the nans are lost when the partitions are concatenated together in compute.

One possible fix may be to modify this line to something like:

 return cls(data=col, index=index, name=name, nan_as_null=False)

However, I'm not immediately sure if there are other corner cases that this doesn't catch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I suppose the best way to test this outside of dask is to add a cudf.concat test that checks that existing nan values are preserved. In the case that there are both nan and null values, then we would probably need to raise an error or "promote" everything to null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the insight! The Series._concat piece is what I was missing. I am going to see what breaks if nan_as_null is set to False. In theory, I would imagine nans and nulls should not be mangled together when merged together

rapids-bot bot pushed a commit that referenced this pull request Jan 11, 2024
Similar to #14687, nan was not being interpreted as null when `nan_as_null=True`

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #14688
@mroeschke
Copy link
Contributor Author

Looks like there's too many moving parts here when having nan_as_null applying toward cudf objects. Maybe it's just easier to ignore this keyword for them so closing

@mroeschke mroeschke closed this Jan 12, 2024
@mroeschke mroeschke deleted the bug/constr/nan_as_null_cudf_objs branch January 12, 2024 18:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants